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Teacher quality is widely recognized as influencing student achievement and success in school. In this 
article, we consider various approaches to the assessment of teacher quality, including process-product 
observational measures, evaluation checklists, professional standards, large-scale surveys, and com- 
mercially available observation systems. We present examples of each from the special education lit- 
erature, consider teacher education research genres for which each is appropriate, and evaluate each 
using a set of criteria that incorporates both practical and technical considerations. We advocate for 
multimethod approaches to teacher quality research and for more research relating what teachers know 
and do to what students learn, and we note that a stronger link between teachers and learners would 
allow for more rigorous evaluations of teacher preparation. 


Drawing on analyses of a subset of teacher quality studies in 
which student outcomes were used as a dependent variable 
(Hess, 2001; Walsh, 2001), the No Child Left Behind Act 
(NCLB; 2001) challenged traditional concepts of good teach- 
ing by emphasizing content mastery and verbal ability and 
downplaying the importance of pedagogy. In turn, the belief 
that pedagogy is a less powerful determinant of student 
achievement than content mastery has led policymakers to 
propose alternatives to traditional teacher preparation. Thus, 
the NCLB encourages states to develop routes that move 
teachers into classrooms on “a fast-track basis” and includes 
in its definition of “highly qualified teachers” individuals en- 
rolled in such alternative routes. The NCLB draws no distinc- 
tion between secondary and elementary teachers, or between 
general and special education teachers, in spite of the fact that 
content mastery would seem to have relatively less relevance 
to the effectiveness of elementary and special education teach- 
ers and that pedagogy would have relatively more. Arguably, 
for most special educators, pedagogy would seem far more 
important as a determinant of achievement than mastery of the 
content they teach, which often involves basic skills. 

Special education alternative routes are proliferating 
(Rosenberg & Sindelar, 2005), most probably in response to 
chronic shortages of special education teachers and the NCLB 
requirement that all teachers be highly qualified by the 2006- 
2007 school year. Furthermore, in a recent national survey of 
special education alternative programs, Rosenberg, Boyer, 
Sindelar, and Misra (in press) identified a small subset that 
has adopted NCLB-like, fast-track approaches as well. In pro- 
grams of this sort, participants are provided limited training. 


in spite of their need for strategies for coping with significant 
learning and behavior problems. Widespread development of 
alternative routes and the existence of even these few fast-track 
alternatives led Rosenberg et al. to conjecture that special edu- 
cation has entered an era in which traditional standards for 
teacher preparation have given way to pragmatism. It is im- 
portant to note that this transition has occurred in spite of 
limited empirical research on the efficacy of preparation al- 
ternatives, including traditional routes, and the equivocal find- 
ings that existing research has yielded (Nougaret, Scruggs, & 
Mastropieri, 2005; Sindelar, Daunic, & Rennells, 2004). 

Thus, from teacher educators’ perspective, research on 
the efficacy of alternative preparation routes seems more crit- 
ical than ever before. For one thing, a policy direction is set, 
and its disregard for pedagogical training undermines what 
teacher educators believe about the scope and rigor of effec- 
tive preparation. Policymakers also have raised the standard 
for credible evidence, so that changes to NCLB policy on 
teacher preparation and teacher quality will require not only 
more evidence but better evidence as well. Besides, to com- 
pete successfully for students in the entrepreneurial world in 
which teacher educators now work, guidance in designing ef- 
fective alternative routes is essential. However urgent these 
considerations may be, rigorous and definitive research on the 
impact of teacher preparation cannot be easily — or cheaply — 
had. Perhaps the first and most important hurdle for teacher 
education researchers to surmount is to identify a credible and 
versatile measure of teacher quality, one that will garner the 
attention of both policymakers, who have set student out- 
comes as the gold standard for teacher quality, and teacher ed- 
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ucators, who understand the difficulty of linking what they do 
first to the competence of their graduates and ultimately to the 
achievement of their graduates’ students. 

We recognize and appreciate the importance of linking 
what teachers do to what their students learn and how well 
they behave. Establishing links would allow teacher education 
researchers the opportunity to focus more specifically on link- 
ing the content of preparation to the competence of graduates. 
Measures of competence or quality (terms we use interchange- 
ably in this paper) have not been used commonly in research 
on teacher preparation, in spite of the fact that concepts of 
teacher quality have evolved through an interesting series of 
representations. 

Concepts of Teacher Quality 

Regardless of one’s purpose in doing so, defining teacher 
quality is no easy task. Reaching consensus on a definition, 
even among teacher educators and researchers, has proven 
elusive. As noted by Berliner (2005), “quality always requires 
value judgments about which disagreements abound” (p. 206). 
Definitions of high-quality teaching range in their focus from 
the actions of the teacher, to the knowledge a teacher possesses, 
to the creativity of the teacher. In recent years, however, lead- 
ing researchers (Berliner, 2005; Fenstermacher & Richardson, 
2005) have focused on the multidimensional nature of the 
concept and have defined teacher quality as encompassing 
two parts: (a) good teaching, meaning that the teacher meets 
the expectations for the role (e.g., holding degrees, using age- 
appropriate methods, upholding the standards of a field of 
study, and other attributes and practices), and (b) effective or 
successful teaching, meaning the results of the teacher’s ac- 
tions on student learning and achievement. In other words, 
one dimension in the absence of the other falls short of fully 
defining teacher quality. 

The history of research on teaching and on the qualities 
that produce great teachers is relatively short. Although there 
were early studies in the 1940s, 1950s, and into the 1960s that 
focused on personal characteristics and experience variables, 
it was not until the late 1960s that researchers turned their ef- 
forts to exploring the link between specific teacher actions and 
student learning (Cochran-Smith & Lytle, 1990; Shulman, 
1986). This process-product approach to research was based on 
behavioral psychology and child development, and although 
general education researchers initiated this line of research, 
some special education researchers conducted similar studies 
and contributed to the findings that influence teaching and 
teacher education today (Blanton et al., 2003). For example, 
effective teachers were found to (a) teach classroom rules and 
monitor expectations, (b) provide clear explanations and ample 
instructional time, (c) maximize the opportunity for students 
to respond during instruction and seatwork, (d) use a brisk 
pace to present lessons and present new material in small 
steps, and (e) provide regular feedback (Berliner, 1984; Chris- 


tenson, Ysseldyke, & Thurlow, 1989; Englert, Tarrant, & 
Mariage, 1992; Good, 1979; Medley, 1978; Rosenshine, 1986; 
Shulman, 1986; Sindelar, Smith, Harriman, Hale, & Wilson, 
1986). 

In the 1970s, research began to address the complexities 
of teaching, classrooms, and schools, approaches to research 
that are referred to by different terms: learning-to-teach re- 
search, classroom ecology research, or interpretive research 
(Berliner, 1989; Doyle, 1983; Fenstermacher Richardson, 
2005; Kagan, 1992; Wideen, Mayer-Smith, & Moon, 1998). 
The literature grew rich with research on teacher planning/ 
decision making (e.g., Reynolds, 1992), teacher thinking 
(e.g.. Carter, 1990), teacher beliefs (e.g., Pajares, 1992), and 
novice versus expert teaching (Berliner, 1986), among other 
topics. As in their efforts in process-product research, special 
education researchers (e.g., Brantlinger, 1996; Fuchs, Fuchs, 
& Bishop, 1992; Nowacek & Blanton, 1996) borrowed from 
these new programs of inquiry and produced findings that 
added to the literature. 

The knowledge base on teaching and understanding 
teacher quality continues to expand and change, focusing on 
both the good and the effective or successful dimensions of 
teacher quality. Currently, accountability and performance stan- 
dards are dominating the teacher quality agenda, with ac- 
companying changes in teacher education accreditation and 
teacher licensure, which are the major quality control mech- 
anisms for the profession. The result of this focus is greater 
attention on such teacher attributes as experience, preparation 
and degrees, and certification (Rice, 2003). 

Regardless of how difficult it is to encompass the con- 
cept of teacher quality, researchers need credible measures to 
build strong research programs. As we have argued, strong re- 
search programs are necessary both to guide teacher education 
program design and to inform policy. Although large-scale 
studies of teacher education program efficacy are under way 
(or complete) in general education (Darling-Hammond, 2000; 
Fenstermacher & Richardson, 2005; Humphrey & Weschler, 
2005; National Commission on Excellence in Elementary 
Teacher Preparation for Reading Instruction, 2003), similar 
efforts are needed in special education. The purpose of this 
paper is to consider existing approaches to assessing begin- 
ning teacher quality and examine their utility for research in 
special education. 

Evaluation of Five Approaches 

We will consider five approaches to defining beginning teacher 
quality and measures associated with them: (a) process- 
product measures, (b) teacher evaluation checklists, (c) stan- 
dards, (d) large-scale surveys, and (e) commercially available 
observations. We first discuss the general problem of teacher 
assessment and the particular problem created by the use of 
student achievement as a measure of teacher quality. Although 
some consider it a gold standard (Greenwood & Maheady, 
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1997; Walsh, 2001), we consider alternatives, or what Kennedy 
(1999) described as “approximations to indicators of student 
outcomes” (p. 345). For each model and measure, we consider 
the teacher education research genres (Kennedy, 1996) to 
which it applies and evaluate it against a set of criteria for 
technical adequacy and practicality. 

The use of student outcomes, particularly achievement, 
as a measure of teacher quality enjoys strong support from 
both education professionals (Greenwood & Maheady, 1997) 
and the policy community (Walsh, 2001). The widespread and 
ready availability of standardized achievement test scores, the 
fruit of state policy on high-stakes assessment, has fostered 
interest in their use as an outcome measure in research on 
teacher quality. Although policymakers have always been in- 
terested in the impact of their initiatives on student learning, 
previously it was difficult to generate an adequately large 
database for analysis. High-stakes assessment has changed all 
that. 

In a discussion of policy research measures, Kennedy 
(1999) commented on the difficulty of linking policy initiatives 
to student outcomes (especially when the outcome of interest 
is complex student learning) and described approximation 
methods. She argued that scores on standardized achievement 
tests, although first-order approximations of complex student 
learning, fail to represent it fully. According to Ketmedy, class- 
room observations, another first-order approximation, may be 
better, particularly when the observations describe “the kind 
of intellectual work that teachers are asking of their students” 
(p. 346). However, observations suffer from other shortcom- 
ings. For one thing, there are no standard observation prac- 
tices; for another, due to the time that observation requires, 
typically only small samples of teacher performance are ob- 
tained. 

As a result, according to Kennedy, researchers tend to rely 
on second-level approximations, or “situated descriptions of 
teaching” (1999, p. 349). Second-level approximations include 
vignettes (with teachers’ responses) and teachers’ daily logs. 
Questionnaires and interviews constitute third-level approxima- 
tions, and personal testimonies are fourth-level approximations. 
Although each level has advantages and disadvantages, gen- 
erally, the more removed a measure is from complex student 
outcomes, the more likely it is that disadvantages will out- 
weigh advantages. The advantages associated with third- and 
fourth-level approximations may be limited to the practical 
considerations of ease of administration and low cost. At those 
levels, technical adequacy may be compromised as well. In 
Kennedy’s framework, most measures we consider in this ar- 
ticle (process-product measures, checklists, and observations) 
are first-level approximations. The representations of teacher 
quality in the Schools and Staffing Survey (SASS; n.d.) and 
the Study of Personnel Needs in Special Education (SPeNSE; 
n.d.) are third-level approximations. 

Kennedy’s argument about complex student learning is 
only one criticism of the use of standardized test scores in as- 
sessing school or teacher quality. Of equal concern, especially 


for teachers, is the relationship between previous learning and 
test scores in any given year. Clearly, students who score poorly 
on standardized tests are likely to score poorly again in the fu- 
ture. Thus, teachers in classrooms with low-achieving students 
will compare unfavorably with colleagues who teach high- 
achieving students, regardless of the quality of their teaching. 
Teachers rightly complain that judgments based exclusively 
on scores from single administrations of achievement tests dis- 
advantage teachers with large numbers of low-performing stu- 
dents. 

In special education, the problem becomes more diffi- 
cult, because classroom teachers and special educators share 
responsibility for educating most students with disabilities. 
Thus, determining which teacher is responsible for what learn- 
ing may be impossible to do with any degree of precision or 
consistency. Furthermore, special education teachers’ roles vary 
from school to school and, for some teachers, from student to 
student. A special education teacher may work with a single 
group for much of the day, work with several groups of stu- 
dents for short periods in a resource room, consult with some 
students’ classroom teachers in planning accommodations and 
adaptations, or co-teach with a classroom teacher. With the pos- 
sible exception of special education teachers in self-contained 
classes, the relationship between special education teacher 
quality and student outcomes is unclear and potentially tenu- 
ous. 

As a result, in this paper, we consider models and mea- 
sures of beginning teacher quality that are approximations of 
student outcomes. We wish a more definitive link were avail- 
able between what special education teachers do and how 
much their students learn. At the same time, we recognize the 
importance of identifying approximations that are accurate 
and credible for teachers, researchers, and policymakers alike. 
Ultimately, of course, the two research traditions must merge 
so that teacher education practices may be linked to teacher 
quality and teacher quality to student outcomes. 

Evaluation Criteria 

The six criteria that we use to evaluate the models and mea- 
sures of beginning teacher quality are utility, credibility, com- 
prehensiveness, generality, soundness, and practicality. These 
criteria are represented as U, CR, CO, G, S, and P in Table 1. 
A plus (-I-) indicates that the criterion is regarded as a strength, 
a minus (-) indicates a weakness, and a plus-minus (±) indi- 
cates both strengths and weaknesses. The table lists specific 
examples of five general classes of models and measures, ap- 
propriate research genres, and the criteria used to evaluate 
each model. 

Some criteria are pragmatic. For example, with regard 
to utility, we need to know whether models and measures have 
been used by other researchers. With a previously used mea- 
sure, we can benefit from colleagues’ experience, and their in- 
sight and advice may help us decide on appropriate measures 
for our own research. For practicality, also a pragmatic con- 
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TABLE 1. Models and Measures of Beginning Teacher Quality 




Research genre 



Criteria used to evaluate models 

Model 

Examples 

1 

2 

3 4 

5 

u 

CR 

CO 

G 

S P 

Process-Product 

COKER (Stallings, 

1980) 




V 

+ 

± 

- 

+ 

+ 

Teacher evaluation 
checklists 

Englert, Tarrant, & 
Manage checklists 
(1992)a 




V 

- 

± 

+ 

+ 

- 


Stanovich & Jordan 
(1998)*’ and Haager et 
al. (2003)*’ 




v 

± 

± 

+ 

± 

+ 

Standards 

CEC Knowledge 
and Skills‘- 




v 

± 

+ 

+ 

+ 

- 


INTASC‘’ 










Representations of 
teacher quality in 
large-scale surveys 

SASS 

SPeNSE 

V 




+ 



+ 

+ + 

Commercially available 
observation 

PRAXIS III 





± 

+ 

+ 

+ 

+ 


Note. 1 = searches for factors that influence student outcomes; 2 = comparative studies of licensed and unlicensed teachers; 3 = follow-up surveys; 4 = experiments; 5 = case stud- 
ies of change over time; U = utility; CR = credibility; CO = comprehensiveness; G = generality; S = soundness; P = practicality; COKER = Classroom Observation Keyed for Ef- 
fectiveness Research: CEC = Council for Exceptional Children; INTASC = Interstate New Teacher Assessment and Support Consortium; SASS = Schools and Staffing Survey 
(SASS, n.d.); SPeNSE = Study of Personnel Needs in Special Education (SPeNSE, n.d.). 

^Designed for student evaluation. ^Designed for research. ^Neither CEC nor INTASC currently offers an assessment process. 


sideration, we need to understand costs, training requirements, 
and the developmental work required to adapt an existing 
model or measure for our own purposes. All other considera- 
tions being equal, inexpensive, easy-to-master, and readily 
adaptable are preferred qualities. 

Some of our criteria are technical in nature. We use 
soundness to refer to the extent to which a measure is reliable 
and valid, and credibility to refer to face validity, which we 
separate from soundness to highlight its relativity and subjec- 
tivity. Although models and measures must be credible to the 
researchers using them, we are equally concerned with the 
credibility of a given model or measure for other stakeholder 
groups — most importantly, teachers, administrators, policy- 
makers, and families. In this sense, credibility may be inferred 
from what we know about how a model or measure was de- 
veloped and validated. We may infer credibility for stakeholder 
groups on the basis of the extent to which they were involved 
in the development or validation process. 

Generality and comprehensiveness refer to a model’s theo- 
retical foundation. Generality requires us to consider how well 
a single model of beginning teacher quality represents the full 
range of contexts in which a special education teacher may 
work. Does the model fairly represent the work of co-teachers, 
consulting teachers, resource room teachers, and teachers in 
self-contained classes? Does the model fairly represent the 
work of teachers of students with high-incidence disabilities 


as well as students whose disabilities are more significant? 
Models that allow for comparability across contexts simplify 
the aggregation of research findings. Comprehensiveness is 
derived from the richness and breadth of the model or mea- 
sure. A better model or measure taps knowledge and disposi- 
tions in addition to skills. A better model or measure includes 
management skills, reflection, and decision making in addi- 
tion to discrete teaching performances. Finally, a better model 
or measure incorporates the work that teachers do with each 
other, families, and communities. 

Research Genres 

Kennedy (1996) described five traditions in teacher education 
research and considered for each genre the teacher education 
elements studied, the measures typically used, and the logic 
underlying each. The five genres are (a) identification of factors 
that influence student outcomes, (b) comparative studies of li- 
censed and unlicensed teachers, (c) follow-up surveys, (d) ex- 
periments, and (e) case studies of change over time. The genres 
are described in greater detail in the paragraphs that follow. 

Factors That Influence Student Learning. Studies of 
faetors that influence student learning commonly use large- 
scale multiple regression models to analyze the statistical re- 
lationships between a set of predictor variables (including 
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teacher qualifications and teacher education variables, e.g., li- 
censure) and a criterion variable (e.g., reading achievement). 
Such studies focus on the “policy parameters” (Kennedy, 1996, 
p. 124) of teacher education (e.g., number of required cred- 
its), and student achievement is typically the criterion of in- 
terest. An effort is made to identify variables that contribute 
to achievement and those that do not. In spite of limitations 
with the genre, studies of factors that influence student learn- 
ing have the distinct advantage in the current policy context 
of using an achievement criterion. 

In special education, researchers have few large-scale, 
nationally representative databases available to them that per- 
mit analysis of teacher quality. In education databases (gen- 
erally), the sample of special education students or teachers, 
if identifiable, is not sufficiently large to permit refined statis- 
tical analyses. For example, in the 2003 administration of the 
National Assessment of Educational Progress (Perie, Grigg, 
& Donahue, 2005), nearly 190,000 students were assessed in 
reading, but fewer than 10,000 of them were identifiable as 
students with disabilities (by virtue of requiring testing accom- 
modations). Similarly, the limited number of special educa- 
tion teachers in the SASS (n.d.) and Teacher Follow-up Survey 
(TFS; n.d.) samples has prevented Boe and his colleagues 
from computing for special education teachers the same array 
of estimates they compute for the general education sample. 
For example, in a recent study of teacher attrition and trans- 
fer, Boe, Cook, and Sunderland (2006) aggregated three SASS 
and TFS administrations but nonetheless were unable to esti- 
mate reliably the number of special education teachers who 
moved from one school to another and the number who left 
due to dissatisfaction with teaching. 

Comparisons. Comparisons of licensed and unlicensed 
teachers typically involve observations of classroom practice 
or performance on teacher assessments. Because differences 
favoring fully qualified teachers are expected, studies of this 
genre test the value of teacher preparation explicitly. How- 
ever, one problem with the logic underlying comparative stud- 
ies is that teacher education is treated as a consistent and 
uniform phenomenon, which it is not. Furthermore, compara- 
tive studies also presume substantial differences in preparation, 
although most unlicensed teachers typically have completed 
at least some teacher preparation. 

Follow-up Studies. Researchers operating within this 
genre presume that teachers themselves are reliable sources 
of information about their knowledge and skills and how these 
were acquired. Such studies may focus on components of 
teacher education and thereby allow for more precision than 
either of the first two genres, in which teacher education is 
considered to be a uniform and consistent intervention. Follow- 
ups that involve telephone, paper-and-pencil, or Web-based 
surveys can be administered widely for little cost. With large 
samples that permit stratification, teacher groups can be differ- 
entiated on key variables (e.g., graduates of 4-year vs. 5-year 


programs). Follow-up studies typically are conducted with 
graduates of a single teacher education program and are most 
useful for faculty there. 

Experiments. In experimental studies of teacher edu- 
cation, a skill is taught in different ways with different groups, 
and differences in skill performance are attributed to differ- 
ences in teacher education pedagogy. Experimental studies 
enjoy several advantages, including clear focus on teacher ed- 
ucation components and assessment of outcomes (e.g., the 
skill being taught). However, such studies focus on training 
discrete, narrowly defined skills, which are part of — but not 
the sum of — teacher quality. Absent from experimental re- 
search are cognition, reflection, and decision making, the el- 
ements thought to make effective teaching a coherent whole. 

Because special education has philosophical roots in 
positivism and special education researchers have method- 
ological skill in designing studies with small samples, exper- 
iments are more common in teacher education research than 
in other genres (Tulbert, Sindelar, Correa, La Porte, 1996). 
However, the problem of tight focus on observable actions is 
evident in special education teacher education research, per- 
haps because of our deep roots in behavioral psychology. 
When experiments have been conducted, single-subject de- 
signs have been used. Although such designs allow researchers 
to demonstrate control over dependent measures, these de- 
pendent measures are limited to observations of discrete ac- 
tions. 

Case Studies. In case studies, teacher candidates are ex- 
amined at the beginning and end of their programs, and pos- 
sibly more often. Differences on these assessments are used 
to describe the process through which a teacher develops. 
Candidates’ knowledge, attitudes, and beliefs may be assessed. 
If cost were no consideration, observations of classroom prac- 
tice also could be used within this genre. In good case study 
research, theory is used to generate and organize questions 
and to suggest directions for change. 

Case study methods, more so than follow-up studies, seem 
well suited to the special education context. Good follow-up 
studies (like the SPeNSE) require large samples, which are 
hard to constitute in special education. For one thing, state- 
to-state variation in certification structure limits our ability to 
lump teachers together into a nationally representative sam- 
ple. Second, some specializations within special education 
(e.g., deaf and hard of hearing) require mastery of at least 
some content that is unique to the specialization. 

Models and Measures 

The five traditions of assessing beginning teacher quality are 
(a) empirical representations of effective practice derived 
from process-product research; (b) more complete and holis- 
tic representations, exemplified by checklists developed by 
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Englert and her colleagues (1992) and others (Stanovich & 
Jordan, 1998); (c) standards; (d) representations of effective 
practice from large-scale surveys (e.g., SASS, SPeNSE); and 
(e) observation systems for classroom teachers such as the 
PRAXIS III, developed and published by the Educational 
Testing Service (ETS). We next weigh these traditions against 
our six evaluation criteria and consider the genres for which 
each would be appropriate. In doing so, we cite research in 
which specific examples of the practices were used. We urge 
readers to consider these studies illustrative and not exhaus- 
tive and to bear in mind that the measures we do consider are 
better thought of as prototypes than as exemplars. 

Process-Product Observational Measures 

In the typical process-product study, teachers are observed at 
work in their classrooms. Teaching and classroom interactions 
typically are described in a series of low-inference behavioral 
categories, often mutually exclusive and exhaustive, so that 
any event may be coded in only one way. The manner in which 
the stream of classroom events is parsed reflects an empirical 
or theoretical conception of teaching. Code frequencies or du- 
rations are aggregated across teachers and related to achieve- 
ment measures. Relationships between patterns of classroom 
performance (or interactions) and student outcomes are de- 
termined statistically. 

To illustrate, in Algozzine, Morsink, and Algozzine’s 
(1988) study of instruction in self-contained special education 
classrooms, the researchers used the Classroom Observation 
Keyed for Effectiveness Research (COKER). Medley, Coker, 
and Soar (1984) described the COKER as an objective, low- 
inference process for observing the ongoing flow of student- 
teacher interaction. Based on its history of use in general ed- 
ucation process-product research and information from the 
manual, Algozzine et al. judged the COKER to be technically 
adequate for their purposes. The system requires trained ob- 
servers who code all of the keys they observe in a given time 
period. In the COKER lexicon, keys are statements describ- 
ing discrete teacher actions — what others might call compe- 
tencies, performances, or behaviors. Eor example, one key under 
Learner Reinforcement and Involvement is “maintains envi- 
ronment in which students are actively involved, working on 
task” (Medley et ah, 1984, p. 162). The COKER is a complex 
system, and Algozzine et al. used three of its Competency Di- 
mensions: (a) Instructional Strategies, Techniques, or Meth- 
ods (7 keys), (b) Communication With Learners (5 keys), and 
(c) Learner Reinforcement and Involvement (5 keys). On the 
basis of these COKER observations, Algozzine et al. reported 
that the teachers in their study performed adequately, but not 
differently, regardless of the classifications of their students. 

Process-product measures like the COKER seem well 
suited to comparison studies of licensed and unlicensed teach- 
ers and longitudinal studies of change, although using such 
complex systems would be costly and labor-intensive. These 
measures can also be used in experiments, as in Stallings’s 


(1980) work on beginning reading instruction. In this study, 
teachers were assessed before and after an intervention designed 
specifically to affect how they allocated time across activities. 

We have alluded to the high cost of repeated adminis- 
trations of process-product measures and other factors that 
limit their practicality. Eor example, extensive training is re- 
quired for COKER observers, and the need to repeat training 
over the duration of a longitudinal study further diminishes 
its practicality. Eurthermore, it is uncertain whether an estab- 
lished system will be sensitive to the changes that a particular 
program is intended to produce. The credibility of a process- 
product measure like the COKER may derive from profes- 
sional consensus, research on effective teaching, or theory. 
COKER keys, originally developed through professional con- 
sensus, were validated in subsequent research (Medley et ah, 
1984). Its use in special education classrooms required a leap 
of faith by Algozzine et al. (1988), but the generality of the 
system was borne out by its utility to the authors. As a limita- 
tion, conceptions of teacher quality derived from the COKER — 
or from process-product measures in general — are based on 
observations of teachers’ actions and fail to tap other dimen- 
sions of what we know to be complex performance. 

Overall, process-product measures have strengths and 
weaknesses. Eoremost among their strengths is the potential 
for highly reliable measurement of the relationships between 
items on the observation system and key criterion variables, 
such as achievement. Among the weaknesses of process- 
product measures is the reliance on teachers’ actions to the 
exclusion of internal events available through interviews, 
logs, and other measures. Their use also may be impractical, 
particularly when extensive training is required for reliable 
administration, or when research designs necessitate repeated 
observations over time. 

Teacher Evaluation Checklists 

In 1992, Englert, Tarrant, and Mariage described a series of 
detailed, moderate-inference checklists that they had devel- 
oped for evaluating field experience students. Taken together, 
Englert et al.’s checklists constitute a rich, detailed model of 
beginning teacher quality. In this section, we consider both 
the original checklists and a research adaptation developed by 
Stanovich and Jordan (1998) for their study of the relation- 
ship of teachers’ and principals’ beliefs about inclusion and 
effective teaching practices. 

The first four checklists described by Englert et al. were 
derived from process-product relationships and covered class- 
room management (15 items), time management (10 items), 
lesson presentation (27 items), and seatwork management 
(9 items). All items were scored on a l-to-5 scale, ranging 
from needs work (1) through satisfactory (2-3) to excellent 
(4-5). The authors offered no guidance about how long an 
observation must be conducted before reliable judgments can 
be made, nor did they provide other evidence of technical 
adequacy. 
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To these process-product checklists, Englert et al. added 
additional items that “involve analyses of the qualitative di- 
mensions of instruction and the social contexts in which students 
are instructed” (1992, pp. 69-70). This enriched process- 
product conceptions by adding items derived from the follow- 
ing four principles of effective teaching: (a) instruction should 
be embedded in meaningful and purposive contexts; (b) class- 
room dialogue may be used to promote self-regulated learn- 
ing; (c) teachers must demonstrate responsiveness to students’ 
instructional needs and interests; and (d) in classroom learning 
communities, “student-to-student and teacher-to-teacher dis- 
course . . . foster deeper conceptual understandings” (p. 80). 
To incorporate these constructivist principles, Englert et al. 
added the Observation Checklist for Examining the Contexts 
for Higher-Order Learning, which corresponds to the four prin- 
ciples: meaningful contexts (4 items), classroom dialogues 
(11 items), responsive instruction (8 items), and classroom 
community (5 items). 

Stanovich and Jordan (1998) adapted Englert et al.’s 
checklists by identifying items that would be most readily and 
predictably observed in a half day, using 8 of Englert et al.’s 
items to constitute a classroom management scale, 8 to create a 
time management scale, and 1 1 to create a lesson presenta- 
tion scale. They also added 4 items to assess the degree of in- 
clusion. Trained observers rated teachers’ performance on these 
3 1 items after a half-day observation. Items were scored as con- 
sistent, inconsistent, or not observed. Total scores were used 
as the criterion measure of effective teaching. All teachers were 
rated by two observers, and agreement between observer pairs 
averaged nearly 80%. Also, The English-Language Learner 
Classroom Observation Instrument (Haager, Gersten, Baker, 
& Graves, 2003) was designed specifically for observing lit- 
eracy instruction by teachers of English-language learners. 
This instrument, whose roots also may be found in process- 
product research, was validated for research purposes and 
has value for classrooms with culturally diverse and English- 
language learners. 

Checklists of this sort lend themselves to the same kinds 
of teacher education studies as do process-product observa- 
tional measures, to which they are closely akin. These seem 
appropriate for use in comparative studies, experiments, and 
case studies of change. Stanovich and Jordan’s adaptation and 
use enhance the utility of the original checklists, which seem 
impractically long and elaborate for research purposes. Eur- 
thermore, Stanovich and Jordan demonstrated that their ab- 
breviated versions can be used reliably and that short-form 
total scores were related to two criterion measures: teacher at- 
titudes and school culture. 

The Englert et al. paper was the most frequently cited 
paper to appear in Teacher Education and Special Education 
through 1995 (Tulbert et al., 1996), which suggests its strong 
credibility to an audience of teacher educators. By inten- 
tion, the full-length checklists are more comprehensive than 
process-product measures, as they include considerations of 
contextual factors, interactions, and community, which are no- 


tably missing from behavioral observation systems and from 
the abbreviated version used by Stanovich and Jordan. The 
checklists have wide applicability in assessing teachers of stu- 
dents with high-incidence disabilities. Englert’s important 
work advanced special education thinking about what consti- 
tutes effective teaching. The ideas she and her colleagues in- 
troduced a decade ago seemed quite radical indeed. At a 
practical level, however, the checklists have never been widely 
used for research purposes, Stanovich and Jordan’s work not- 
withstanding. 

Standards 

The Council for Exceptional Children (CEC) began prom- 
ulgating teaching standards in the early 1990s and in 2001 
published a revised edition of The CEC Standards for the 
Preparation of Special Educators. This document begins with 
narrative descriptions of 10 content standards: foundations, 
development and characteristics of learners, individual learn- 
ing differences, instructional strategies, learning environments 
and social interactions, communication, instructional plan- 
ning, assessment, professional and ethical practice, and col- 
laboration. Each content standard is then described in terms 
of the knowledge and skill competencies it comprises. 

Fifty-four knowledge and 72 skill statements make up 
the common core. Additional sets describe generic practice 
with students with high-incidence (individualized general cur- 
riculum) and severe (individualized independence curriculum) 
disabilities. Specialized practice is represented in six cate- 
gorical areas and two other areas (early childhood and transi- 
tion specialist) defined by the age levels of the students served 
and the nature of appropriate programming. A trainee who 
completes generic preparation for teaching students with high- 
incidence disabilities is expected to demonstrate proficiency 
on 126 competencies in the core as well as 42 knowledge and 
47 skill statements in the individualized general-curriculum- 
referenced standards. A teacher preparing to work with students 
with specific learning disabilities (SLD) must demonstrate 
proficiency on 174 competencies: the core plus 27 knowledge 
and 21 skill statements specific to SLD. 

The CEC’s knowledge competencies are written as 
general, descriptive phrases. One statement from the common 
core — development and characteristics of learners — treads, “Ed- 
ucational implications of characteristics of various excep- 
tionalities” (CEC, 2001, “Common Core,” p. 1). Another from 
the specialized knowledge base in mental retardation — 
development and characteristics of learners — is “causes and 
theories of intellectual disabilities and implications for pre- 
vention” (CEC, 2001 , “Mental Retardation/Developmental Dis- 
abilities,” p. 2). Skill standards start with verbs and are like 
knowledge statements in their generality. In fact, some skill 
descriptions are so general as to belie their use as categorical 
standards. For example, teachers of students with SLD are ex- 
pected to “use specialized methods for teaching basic skills” 
(CEC, 2001, “Learning Disabilities,” p. 1). Others are more 
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precisely described and more clearly associated with a par- 
ticular categorical area; for example, “demonstrate appropri- 
ate body mechanics to assure student and teacher safety in 
transfer, lifting, positioning, and eating” (CEC, 2001, “In- 
dependence Curriculum,” p. 3) seems quite specific to indi- 
vidualized independent-curriculum-referenced standards, 
learning environments, and social interactions. 

Whereas CEC knowledge and skill items are precisely 
defined, standards developed by the Interstate New Teacher 
Assessment and Support Consortium (INTASC) are fewer in 
number and more broadly conceived. INTASC standards are 
organized by the principle to which they are related. By virtue 
of the CEC’s effort to align their standards with those of IN- 
TASC, the 10 INTASC principles are roughly analogous to 
the CEC’s 10 content standards. 

The title of INTASC ’s document for teachers working 
with students with disabilities. Model Standards for Licens- 
ing General and Special Education Teachers of Students With 
Disabilities: A Resource for State Dialogue (2001), hints at its 
organization. Every principle is elaborated into standards for 
general and special education teachers and additional stan- 
dards for special education teachers only. INTASC standards, 
first released in 1992, were designed for compatibility with 
standards for accomplished practice promulgated by the Na- 
tional Board of Professional Teaching Standards (NBPTS). 
The special education initiative began in 1997. These standards, 
which were developed by a committee of general and special 
education teachers and teacher educators, include knowledge, 
skills, and dispositions that build on and are organized by the 
core principles. The purpose of these standards is to differen- 
tiate general from special education teachers’ roles, with ref- 
erence to (a) content (Principle 1); (b) pedagogy (Principles 
4-10); (c) knowledge of students with disabilities (Principles 
2 and 3); and (d) contexts (Principle 10). There are 49 stan- 
dards for both general education and special education teach- 
ers and an additional 49 for special education teachers. 

INTASC (2001) standards (a) emphasize that “teaching 
and learning are dynamic and interactive processes” (p. 2); 
(b) are responsive to students’ contexts; and (c) encourage users 
to take standards as a whole “to convey a complete picture of 
the acts of teaching and learning” (p. 2). Unlike the CEC stan- 
dards, INTASC knowledge, skills, and dispositions are not 
differentiated by teachers’ roles or students’ disability classi- 
fications. The statements are written in paragraph-length nar- 
ratives of complete sentences. Typically, a principle is broken 
down into elements, which are elaborated in the standards. For 
example, for Principle 3, “the teacher understands how stu- 
dents differ in their approaches to learning and creates instruc- 
tional opportunities that are adapted to diverse learners” (p. 17), 
the general and special education teacher standards include 
(a) building awareness of disability and respect for students 
with disabilities, (b) recognizing that students with disabili- 
ties make up a heterogeneous group, (c) understanding fami- 
lies’ perspectives on disabilities, and (d) recognizing that some 
differences may be mistaken for disability. 


Because neither the CEC nor INTASC standards offer 
an assessment process (although assessments are in the works), 
it is impossible to speak to the issue of their technical sound- 
ness or their utility as an approximation for student achieve- 
ment. At present, standards seem most useful for guiding the 
development of surveys of graduates in follow-up studies and 
in interviews used in longitudinal studies of change, perhaps 
associated with accreditation reviews. Standards have rarely 
been used as outcome measures in teacher education research; 
their use in Nevin, Thousand, Parsons, and Lilly (2000) is the 
only instance we found in the special education literature. How- 
ever, both the CEC and INTASC standards have the decided 
advantage of being fully comprehensive and general by de- 
sign — the CEC standards in a formal sense by differentiating 
knowledge and skill items and by roles. Both conceptions in- 
clude important work that teachers do outside the classroom. 
In the CEC’s collaboration standard and INTASC ’s Principle 
10, “The teacher fosters relationships with school colleagues, 
families, and agencies in the larger community to support stu- 
dents’ learning and well being” (INTASC, 2001, p. 37). Both 
the CEC and INTASC standards were developed over itera- 
tions with input from key stakeholders. 

For the moment, standards have limited potential as out- 
come measures in special education teacher education research, 
except as a guide to survey or interview development in follow- 
up or longitudinal research. At the same time, the conceptions 
of beginning teacher quality represented in these standards are 
detailed, coherent, and complete. The standards do represent 
contemporary professional thought but, unlike process-product 
measures, lack empirical connection to student outcomes. 

Large-Scale Surveys 

Questions in the SASS (n.d.) and SPeNSE (n.d.) teacher sur- 
veys also constitute representations of beginning teacher qual- 
ity. The SASS has been administered five times since 1987 by 
the National Center on Educational Statistics (NCES), and 
Boe and his colleagues have used the SASS data in analyses 
of special education teacher supply and demand (Boe, Bob- 
bitt, & Cook, 1997; Boe, Cook, Bobbitt, & Terhanian, 1998; 
Boe, Cook, Kaufman, & Danielson, 1996). The SASS sample 
taps the universe of public and private schools in the United 
States. The SPeNSE survey was administered once to a na- 
tionally representative sample of general and special educa- 
tion teachers. 

SASS. The SASS Teacher Questionnaire asks teachers 
to specify demographics, educational backgrounds, certifi- 
cation(s), and years of experience. Additional questions tap 
(a) length of practice teaching, (b) first-year duties and sup- 
ports, (c) mentoring, and (d) professional development. The 
questionnaire assesses job satisfaction, attitudes and percep- 
tions about support, influence in school, school safety, and be- 
havior. Teachers also are asked how well prepared they felt in 
their first year of teaching for management, instructional 
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methods, technology, lesson planning, assessment, and selec- 
tion and adaptation of instructional materials. Other questions 
focus on professional development. 

SPeNSE. In the SPeNSE teacher survey, teachers were 
asked about preservice preparation and to indicate the num- 
ber of hours of professional development they received over 
the previous 1 2 months in each of 27 areas. Teachers also were 
asked to indicate their degree of agreement on a Likert-type 
scale with statements such as “I am skillful in planning ef- 
fective lessons,” and “I am skillful in teaching reading or pre- 
reading skills.” Thus, for the 27 preparation areas, teachers 
indicated any preservice training, specified hours of profes- 
sional development, and judged their degree of mastery. 

SPeNSE included a second set of questions about pro- 
fessional development. Teachers were asked whether they had 
a personal professional development plan and whether they 
had participated in any of 12 professional development activ- 
ities over the past year. They indicated their hours spent in 
professional development and benefits of these experiences 
(e.g.. Improved your effectiveness as a teacher? Been respon- 
sive to your professional development needs?). This section 
of the survey ended with six more questions related to mentor- 
ing, contacts with teachers and other education professionals, 
reading professional journals, and association membership. 

Representations of beginning teacher quality, like SASS 
and SPeNSE, clearly are intended for use in follow-up survey 
research. Their utility is evident, and these surveys have been 
used with both general and special education teachers and 
across special education contexts. The entire research genre is 
practical in that extensive data may be generated relatively in- 
expensively and relatively quickly. Because SASS and SPeNSE 
have been validated by use in previous research, these surveys 
are presumed to be technically sound, although the concep- 
tions of beginning teacher quality that can be inferred are 
sketchy and incomplete relative to other potential measures 
(e.g., standards). Generally, the credibility of surveys like these 
is limited by the self-report format and its potential for inac- 
curacy and bias. 

Commercially Available Observation 
Systems: PRAXIS III 

PRAXIS III (Dwyer, 1993, 1994) is “a system for assessing the 
teaching skills of beginning teachers” (Dwyer, 1998, p. 163). 
(PRAXIS I is a test of enabling skills such as reading, writ- 
ing, and arithmetic, and PRAXIS II is a test of subject matter 
knowledge and teaching principles.) The 19 PRAXIS criteria, 
which are organized into four domains, were developed from 
research (Reynolds, 1992), job analyses, and a multistate va- 
lidity study. The criteria were piloted in the field and refined 
in collaboration with practicing teachers. During its develop- 
ment, PRAXIS III went through five iterations. 

The development process, begun in 1987 and completed 
in 1993, involved (a) establishing an underlying conception 


of teaching, (b) developing a plan for defining teaching, and 
(c) linking this definition to assessments. The underlying con- 
ception of teaching emphasizes the importance of action and 
decision making and the consideration of individual, school, 
and community contexts. Because learning is presumed to in- 
volve the active construction of knowledge, assessments must 
take place in classrooms. Teachers are afforded opportunities 
to explain their actions, and scoring allows for the reality that 
good teaching can take many forms. Skilled professionals are 
thought to make the best assessors. The 19 criteria are or- 
ganized into four domains: (a) organizing content knowledge 
for student learning, (b) creating an environment for student 
learning, (c) teaching for student learning, and (d) teacher pro- 
fessionalism. 

Dwyer (1998) asserted that these criteria establish “a vi- 
sion of teaching . . . derived from working closely with teach- 
ers themselves . . . relevant to teachers’ own practice and 
concerns . . . [and] informed by the theoretical and policy 
perspectives of other educators and researchers” (p. 172). 
PRAXIS III involves three data collection processes: (a) di- 
rect observation of classroom practice (in which a running 
narrative is kept), (b) written materials (class and teacher pro- 
files and a lesson plan), and (c) interviews (before and after the 
observation) related to the lesson. 

Trained assessors observe teachers as they teach a lesson 
of their choice to a group of their choice. Using the full record 
of evidence — the two profiles, lesson plan, running observa- 
tional record, and interview protocols — assessors rate teach- 
ers on the 19 criteria. The scale used is from 1.0 to 3.5; a rating 
of 2.0 represents minimally satisfactory performance. Scor- 
ing is guided by a rubric linked to the nature of the evidence. 
Assessor training, which requires 5 days, is considered es- 
sential for identifying evidence relating to criteria and to using 
evidence to reach judgment. 

Observation systems like PRAXIS III may be used in 
comparative studies and in longitudinal studies of change. The 
generality of its criteria preclude its use in experiments unless 
ratings are conducted on a pre-post basis (as for process- 
product measures) and the intervention is designed specifically 
to affect performance on one or more criteria. The conception 
of teacher quality is highly credible, given the systematic 
manner with which it was developed and the participation of 
key stakeholders throughout. However, as an assessment for 
classroom teachers, PRAXIS III made no special adaptations 
for special education practice. 

PRAXIS III has been used in one special education study 
(Sindelar et ah, 2004). In this study, the system worked well 
with a sample of special education teachers, prompting the 
authors to conclude that PRAXIS III “provide[d] a clear and 
coherent picture of the competence of teachers . . . despite 
the fact that Praxis III was designed to assess general educa- 
tion teachers” (p. 222). They noted that the pattern of perfor- 
mance across the 19 criteria supported this assertion in the 
sense that their sample of special education teachers per- 
formed relatively well where expected and relatively less well 
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where expected (e.g., “extends students’ thinking”). Ratings 
on six criteria and two domain summary scores differentiated 
graduates of three distinct teacher education program types. 

Thus, PRAXIS III rates high marks on utility and cred- 
ibility. Furthermore, the richness of the record of evidence 
creates a comprehensive picture of beginning teacher quality. 
With regard to soundness, Dwyer (1998) emphasized its con- 
struct validity and argued that construct validity was the most 
important consideration for teacher observation systems. Fur- 
thermore, ETS developed PRAXIS to market to states as a 
legally defensible process for licensing beginning teachers. 
The reliability of assessors’ ratings is implied by the exten- 
siveness of their training. However, PRAXIS III adminis- 
tration is highly costly and labor-intensive. Training is also 
costly and, for longitudinal studies, would need to be repeated 
for new assessors. Although the picture of teaching compe- 
tence derived from PRAXIS III observations is rich and sound, 
the system may be impractical for some purposes. 

Summary and Recommendations 

Teacher quality means different things to different people. 
Moreover, different people use models and measures of teacher 
quality differently depending on their purposes. On one hand, 
a researcher may be willing to use measures that take time and 
are more difficult to administer because of an interest in un- 
derstanding deeply many dimensions of teacher quality. On 
the other hand, a policymaker may want to acquire informa- 
tion quickly and efficiently and thus will call on measures that 
will accomplish this purpose. For example, defining begin- 
ning teacher quality as being fully certified is likely to have 
greater credibility among policymakers than among research- 
ers. Even within the community of researchers who study 
teacher quality, there is no single definition or measure for be- 
ginning or experienced teachers, either in general education 
or in special education. 

As inquiry into teaching and teacher education has grown 
and matured, in both general education and special education, 
models and measures of teacher quality have evolved. Yet, in 
special education, research on teaching (e.g., process-product 
studies) has focused most often on teachers of students with 
high-incidence disabilities, and, in this article, we focused on 
the same group. Our analysis persuades us that the models and 
measures we have presented are useful in studying this sub- 
set of special education teachers. A review of the literature re- 
lated to beginning teachers serving culturally diverse and 
English-language learners (Blanton et ah, 2003) led us to con- 
clude that the models and measures are also useful for exam- 
ining teacher quality in this area. 

In this paper, we identified classes of models and mea- 
sures, presented illustrative examples of each class, considered 
research genres for which each class would be appropriate, 
and discussed their merits using evaluation criteria. These 
analyses, as summarized in Table 1 , lead to a single, irrefutable 


conclusion: The superiority of one model over another de- 
pends on the purpose and context of its use. Eor most purposes, 
the best approach would be to pick and choose from several 
models — and, indeed, using multiple measures in teacher ed- 
ucation research is encouraged by most authorities (Wilson, 
Eloden, & Eerrini-Mundy, 2002). In special education, there 
is a great need to accelerate research on beginning teacher 
quality by drawing on models and measures set forth in this 
paper. Our analysis suggests that special education teacher ed- 
ucation research should be guided by these considerations: 

1 . Use multiple research traditions in conducting 
research on teacher quality and multiple mea- 
sures associated with those traditions. With the 
exception of process-product research, special 
education has produced only a handful of re- 
search studies drawn from other research tra- 
ditions. This fact alone calls out to special 
educators to expand research on teacher quality 
to include programs of research focused on un- 
derstanding the complexity of teachers’ actions 
and interactions with students and contexts. 

2. Get the attention of policymakers by producing 
compelling research findings and by linking 
measures of teacher quality to student out- 
comes. Because policymakers need measures 
of teacher quality to communicate with the 
public, we offer two recommendations to spe- 
cial education researchers. Eirst, it is critical 
that the special education research community 
take research on teacher quality seriously. We 
must accumulate findings that policymakers 
find credible and that distinguish our field 
from general education. 

A second facet of this problem derives 
from the fact that the difficulties special educa- 
tion researchers have in linking teachers’ com- 
petence to student achievement may be too 
esoteric for policymakers to appreciate fully. 

Eor example, because the vast majority of stu- 
dents with disabilities spend the majority of 
the school day in general education classes, 
more than one teacher contributes to their aca- 
demic growth (unlike, say, a typical third 
grader). In the typical special education con- 
text, determining which teacher is responsible 
for how much growth may be a practical im- 
possibility. Eurthermore, even if researchers 
could successfully parse achievement by 
teacher, their ability to relate achievement 
growth to teacher quality would be diluted by 
the smaller effect sizes associated with individ- 
ual teachers. Second, special education teach- 
ers work with students whose abilities and 
rates of progress vary widely, and they are un- 
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likely to have an adequate number of students 
at roughly the same level of ability so as to 
make prediction feasible — as would a typical 
third-grade teacher, for instance. Furthermore, 
whereas in a typical elementary school there 
are three or four third-grade teachers, each 
working with children at about the same ability 
level, schools are likely to have one or two 
special education teachers, who probably di- 
vide their work by grade level or by the sever- 
ity of their students’ disabilities. To find four 
special education teachers all working with 
children at the same level of ability, then, re- 
searchers would have to be in four schools, 
and even then there would be no guarantee 
that children in those four schools were served 
in the same delivery model, or for the same 
length of time, or in the same curriculum. 

Thus, the difficult logistics and high cost of 
conducting research in schools are multiplied 
for special education researchers. Our job is to 
educate policymakers about the complexity of 
teaching and learning in special education con- 
texts so that policy becomes something more 
than a standard solution imposed on distinct 
problems. 

3. Validate assessments based on teaching stan- 
dards. We noted earlier in our paper that as- 
sessments have not been developed, or are in 
very early stages of development, for the CEC 
and INTASC standards. Other groups, most 
notably the NBPTS, are farther along on the 
validation process, and we believe the CEC 
and INTASC would be well advised to follow 
NBPTS ’s lead by developing research pro- 
grams to validate their assessment processes. 
We found three validation studies of the 
NBPTS assessment process (Bond, Smith, 
Baker, & Hattie, 2000; Goldhaber & Anthony, 
2004; Vandevoort, Amrein-Beardsley, & Ber- 
liner, 2004). In these studies, NBPTS-certified 
teachers were compared with non-NBPTS- 
certified teachers (Goldhaber & Anthony, 

2004; Vandevoort et al., 2004) or to teachers 
who had applied for certification but were 
turned down (Bond et al., 2000). Bond and 
colleagues studied teachers who applied for 
certification in either the early adolescence lan- 
guage arts or middle childhood generalist cate- 
gory, whereas both Goldhaber and Anthony 
(2004) and Vandevoort et al. (2004) studied el- 
ementary teachers. These researchers found 
that students of NBPTS-certified feachers typi- 
cally outperformed students of comparison 
group teachers on measures of academic 


achievement (Goldhaber & Anthony, 2004; 
Vandevoort et al., 2004) or the quality of stu- 
dent work (Bond et al., 2000). The teachers in 
the Bond et al. study consistently outperformed 
comparison teachers on 1 3 measures of teach- 
ing excellence. 

The questions such studies address are important ones, 
but knowing that the NBPTS (or INTASC or CEC) assessment 
process validly differentiates good teachers from outstanding 
ones has limited utility in the context of teacher shortages. In 
special education, it may be more important for assessment 
systems to reliably differentiate basically competent from in- 
competent teachers. This distinction seems appropriate for 
INTASC standards, which are designed for new teachers, but 
it is unclear as to where the CEC stands on this issue. In our 
judgment, because of the leadership role that the CEC plays 
in the field of special education and until shortages are ad- 
dressed, that organization should concentrate on developing 
an assessment process for novice teachers and establishing its 
validity in identifying basically competent ones. 

Research on teacher quality is as challenging as it is im- 
portant. Policy is in place that most special education teacher 
educators regard as detrimental to teachers’ professional de- 
velopment and students’ success in school. A high standard 
has been set for the credibility of evidence that policymakers 
will consider. Historically, teacher education researchers have 
endeavored to link variation in preparation to variation in 
teacher competence and, in doing so, did not restrict outcomes 
to a student achievement standard. In fact, Kennedy (1999), 
among others, has argued that direct observation is a better 
measure of teacher quality than achievement test scores, pro- 
vided that the outcome of interest is complex student learn- 
ing. Nonetheless, in the current policy context, scholars would 
be naive to ignore student outcomes. Thus, future research in 
our field must focus on the validation of measures of teacher 
quality. Only then will researchers have the tools they need to 
link preparation variables to credible measures of teacher 
quality. Only then will they garner the attention of policy- 
makers. 
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